System and method for organizing data

ABSTRACT

A system and method for organizing raw data from one or more sources uses an improved mechanism for identifying duplicate data between fields (e.g., columns) in the databases. The fields may be similar fields within a single database or similar or identical fields within a pair of databases and as organized as arrays or field vectors. The present invention sorts each of the field vectors and if necessary, partitions them by common value. A number of comparisons required to identify the duplicate data between the field vectors is reduced by feeding back a difference between the compared values. This difference is used to adjust indices into the field vectors for subsequent comparison.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation application of co-pendingapplication Ser. No. 13/775,489, filed Feb. 25, 2013; which applicationis a continuation application of application Ser. No. 10/219,658, filedAug. 16, 2002, now abandoned; which application is a continuationapplication of co-pending application Ser. No. 09/412,970, filed Oct. 6,1999, now U.S. Pat. No. 6,457,006; which application is acontinuation-in-part application of co-pending application Ser. No.09/357,301, filed Jul. 20, 1999, now U.S. Pat. No. 6,424,969.

The specification and drawings of each of the aforementionedapplications are incorporated herein in their entirety, by thisreference.

BACKGROUND

1. Field of the Invention

The present invention relates to database systems and more particularly,to a system and method for organizing data in a database system.

2. Discussion of the Related Art

Computerized database systems have long been used and their basicconcepts are well known. A good introduction to database systems may befound in C. J. DATE, INTRODUCTION TO DATABASE SYSTEMS (Addison Wesley,6th ed. 1994).

In general, database systems are designed to organize, store andretrieve data in such a way that the data in the database is useful. Forexample, the data, or partitioned sets of the data, may be searched,sorted, organized and/or combined with other data. To a large extent theusefulness of a particular database system, is dependent on theintegrity (i.e., the accuracy and/or correctness) of the data in thedatabase system. Data integrity is affected by the degree of “disorder”in the data stored. Disorder may occur in the form of erroneous orincomplete data such as duplicate data, fragmented data, false data,etc. In many database systems, from time to time, existing data may beedited and processed, and as a result, additional errors may beintroduced. In some database systems, new data may be introduced.Additionally, as database systems are upgraded with new hardware and/orsoftware, data conversion may be required or additional fields maybecome necessary. Furthermore, in some applications, the data in thedatabase may simply become outdated over time.

Regardless of the preventative steps taken, some degree of disorder iseventually introduced in conventional database systems. This degree ofdisorder increases exponentially over time until eventually, the data ina conventional database becomes entirely useless. As a result, even asmall degree of disorder eventually affects the integrity of thedatabase system.

Unfortunately, identifying and correcting disorder in the data are oftendifficult, if not impossible, tasks particularly in large databasesystems. Traditionally, such tasks are performed manually, making thesetasks time-consuming, expensive, and subject to human error.Furthermore, due to the very nature of the task, much of the disordermay go largely undetected. What is needed is a system and method fororganizing data in a database system to overcome these and otherassociated problems.

SUMMARY OF THE INVENTION

The present invention provides a system and method for organizing datain a database system. The present invention derives a distilled databaseof accurate data from raw data included in one or more raw data sources.The raw data is converted from its original format(s) to a numericformat. According to one embodiment of the present invention, the rawdata is represented as a vector having numeric elements. Once the rawdata is represented numerically, various mathematical operations such ascorrelation functions, pattern recognition methods, or other similarnumeric methods, may be performed on these vectors to determine howcontent in a particular vector corresponds to others vectors in a“distilled” or reference database. The distilled database is formed fromsets of one or more related vectors that are believed to be unique(e.g., orthogonal) with respect to the other sets. These sets representthe best information available from the raw data. After all the raw datahas been incorporated into the distilled database, new data may bescreened to ensure that new errors are not introduced into the distilleddatabase. The new data may be also evaluated to determine whether it isunique or whether it includes better information than that alreadypresent in the distilled database. The new data is added to thedistilled database accordingly.

One of the features of the present invention is that raw data isconverted into a numeric format based on a number system having anappropriate radix. An appropriate radix is determined according to thetype of information included in the raw data. For example, for raw datagenerally comprised of alpha-numeric characters, an appropriate radixmay be greater than or equal to the number of different alpha-numericcharacters present in the raw data. Using such a number system allowsraw data to be represented numerically, allowing for manipulationthrough various well-known mathematical operations.

Another feature of the present invention is that the number system maybe selected so that the numbers-themselves retain semantic significanceto the raw data they represent. In other words, the numerals in thenumber system are selected so that they correspond to the raw data. Forexample, in the case of raw data comprised of alphanumeric characters,the numerals are selected to correspond to the alphanumeric charactersthey represent. When the numerals in the number system are subsequentlydisplayed, they appear as the alphanumeric characters they represent.

Another feature of the present invention is that once the raw data isrepresented as vectors in an appropriate number system, the representeddata may be efficiently manipulated in the database (e.g., sorted, etc.)using various well-known techniques. Furthermore, various well-knownmathematical operations may be performed on the vectors to analyze thedata content. These mathematical operations may include correlationfunctions, eigenvector analyses, pattern recognition methods, and othersas would be apparent.

Still another feature of the present invention is that the raw data isincorporated into a distilled database. The distilled databaserepresents the best information extracted from the raw data withouthaving any data disorder.

Yet another feature of the present invention is that new data may becompared to the distilled database to determine whether the new dataactually includes any new information or content not already present inthe distilled database. Any new information not already in the distilleddatabase is added to the distilled database without adding any disorder.In this manner, the integrity of the distilled database may bemaintained.

Other features and advantages of the invention will become apparent fromthe following drawings and description.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is described with reference to the accompanyingdrawings. In the drawings, like reference numbers indicate identical orfunctionally similar elements. Additionally, the left-most digit(s) of areference number identifies the drawing in which the reference numberfirst appears.

FIG. 1 illustrates a processing system in which the present inventionmay be implemented.

FIG. 2 illustrates stages of data processed by one embodiment of thepresent invention.

FIG. 3 is a flow diagram for converting raw data from its originalformat into a numeric format in accordance with one embodiment of thepresent invention.

FIG. 4 illustrates a data record suitable for use with the presentinvention.

FIG. 5 illustrates raw data tables suitable for use with the presentinvention.

FIG. 6 illustrates reference data tables, representing data formatted inaccordance with an embodiment of the present invention.

FIG. 7 is a flow diagram for analyzing reference data in accordance withan embodiment of the present invention.

FIG. 8 illustrates distilled data table, representing related datacorrelated in accordance with an embodiment of the present invention.

FIG. 9 illustrates an example of data clustering in a two-dimensionalspace.

FIG. 10 is a flow diagram for identifying duplicate data among a pair offield vectors.

FIG. 11 is a flow diagram for identifying duplicate data among a pair offield vectors in further detail.

FIG. 12 illustrates an example of identifying duplicate data among apair of field vectors.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is directed to a system and method for organizingdata in a database system. The present invention is described below withrespect to various exemplary embodiments, particularly with respect tovarious database applications. However, various features of the presentinvention may be extended to other areas as would be apparent. Ingeneral, the present invention may be applicable to many databaseapplications where large amounts of seemingly unrelated data must becompiled, stored, manipulated, and/or analyzed to determine the variousrelationships present in the content represented by the data. Moreparticularly, the present invention provides a method for achieving andmaintaining the integrity (i.e., accuracy and correctness) of data in adatabase system, even when that data initially possesses a high degreeof disorder. As used herein, disorder refers to data that isduplicative, erroneous, incomplete, imprecise, false or otherwiseincorrect or redundant. Disorder may present itself in the databasesystem in many ways as would be apparent.

One embodiment of the present invention is used to maintain a databaseassociated with accounts receivable. In this embodiment, a company maycollect data relating to various persons, businesses and/or accountsfrom one or more sources. These sources may include, for example, creditcard companies, financial institutions, banks, retail, and wholesalebusinesses and other such sources. While each of these sources mayprovide data relating to various accounts, each source may provide datarepresenting different information based on its own needs. Furthermore,this data may be organized in entirely different ways. For example, awholesale distributor may have data corresponding to accounts receivablecorresponding to business accounts. Such data may be organized byaccount numbers, with each data record having data fields identifying anaccount number, a business associated with that account number, anaddress of that business, and an amount owed on the account. A retailcompany may have data records representing similar information but basedon accounts corresponding to individuals as well as businesses.

In other embodiments of the present invention, other types of sourcesmay provide different types of data. For example, the scientificinstitutions may provide scientific data with respect to various areasof research. Industrial companies may provide industrial data withrespect to raw materials, manufacturing, production, and/or supply.Courts or other types of legal institutions may provide legal data withrespect to legal status, judgments, bankruptcy, and/or liens. As wouldbe apparent, the present invention may use data from a wide variety ofsources.

In another embodiment of the present invention, a database may bemaintained to implement an integrated billing and order control system.In addition to billing-type information from sources similar to thosedescribed above, this embodiment may include data records correspondingto inventory, data records corresponding to suppliers of the inventory,and data records corresponding to purchasers of the inventory. Inventorydata may be organized by part numbers, with each data record having datafields identifying an internal part number, an external part number(i.e., supplier part number), a quantity on hand, a quantity expected toship, a quantity expected to be received, a wholesale price, and aretail price. Supplier data may be organized by a supplier number; andcustomer data may be organized by a customer number. Data recordscorresponding to each of these records may include data fieldsidentifying a part number, a part price, a quantity ordered, a shipdata, and other such information.

Another embodiment of the present invention may include an enterprisestorage system that consolidates corporate information from multiple,dissimilar sources and makes that information available to users on thecorporate network regardless of the type of the data, the type ofcomputer that generated the data, or the type of computer that requestedthe data. Still another embodiment of the present invention includes abusiness intelligence system that warehouses and markets information andallows that information to be processed and analyzed on-line.

The present invention enables raw data collected from different sourcesto be analyzed and distilled into a collection of accurate data,organized in a way that is useful for a particular application. Usingthe above example of an integrated billing and order control system,explained more fully below, the present invention may produce adistilled database in which related data, such as data relating to aparticular supplier or customer; may be identified as such. In thisexample, duplicate data corresponding to the same supplier or customermay be identified and/or discarded, and erroneous data associated withthe supplier or customer may be identified, analyzed, and possiblycorrected.

In general, the present invention may be implemented in hardware orsoftware, or a combination of both. Preferably, the present invention isimplemented as a software program executing in a programmable processingsystem including a processor, a data storage system, and input andoutput devices. An example of such a system 100 is illustrated inFIG. 1. System 100 may include a processor 110, a memory 120, a storagedevice 130, and an I/O controller 140, coupled to one another by aprocessor bus 150. I/O controller 140 is also coupled via an I/O bus 160to various input and output devices, such as a keyboard 170, a mouse180, and a display 190. Other components may be included in the system100 as would be apparent.

FIG. 2 illustrates various forms of data processed by the presentinvention. Raw data 210 may be collected from one or more sources, suchas raw data 210A and raw data 210B. As used herein, “raw data” simplyrefers to data as it is received from a particular source. Additionalsources of raw data 210 may be included as would be apparent. Asexplained below, raw data 210 from various sources is converted into anumerical format and stored in a reference database 220. Using a processreferred to herein as “data dialysis,” the present invention “purifies”raw data 210 to form reference data in reference database 220. Referencedatabase 220 includes all the information found in raw data 210including duplicate, incomplete, inconsistent, and erroneous data.

Distilled data stored in a distilled database 230 is derived from thereference data of reference database 220. Distilled data represents the“accurate” data available from raw data 210. Distilled database 230includes the unique information found in raw data 210. Distilled datathus represents the best information available from raw data 210.

As also explained below, the present invention further provides forusing distilled database 230 to analyze and verify new data 240, whichmay also be used to update the reference database 220 and distilleddatabase 230 as appropriate.

While the present invention has numerous embodiments, to clarify itsdescription, a preferred embodiment is explained with reference to FIGS.3-8 in a context of an integrated billing and order control system. Inthis embodiment, raw data 210 is a collection of data collected fromvarious sources, such as order processing, shipping, receiving, accountspayable and accounts receivable, etc. This raw data 210 may include datarecords that are related but have different data fields, duplicate datarecords, data records having one or more erroneous data fields, etc. Toaddress such errors, the present invention converts raw data 210 fromtheir original formats and data structures (which may vary based on thesource) into a numeric format and stores this reference data inreference database 220.

According to the present invention, the reference data is then comparedand analyzed to distill the best information available. In oneembodiment of the present invention, this best information may be storedas distilled data in distilled database 230. This process is nowdescribed.

Collecting Raw Data

FIG. 3 illustrates the process by which raw data 210 is converted intoreference data in reference database 220 according to one embodiment ofthe present invention. In a step 310, raw data 210 is collected from araw data source. As illustrated in FIG. 2, raw data 210 may include datafrom one or more sources such as raw data 210A and raw 210B. As usedherein, “data” refers to the physical digital representation ofinformation, and data “content” refers to the meaning of, or informationincluded in or represented by that data. The different records in rawdata 210 may include similar types of data content. For example, in abilling context, different records in raw data 210 may all include datacontent relating to a particular account.

Raw data 210 will typically be received in the form of data records 400,as illustrated in FIG. 4. Each data record 400 generally includesrelated information, such as information for a specific individual,company, or account. Each data record 400 stores this information in oneor more data fields 410. Examples of possible data fields 410 include,for example, an account number, a last name, a first name, a companyname, an account balance, etc. Each data field 410, in turn, may includeone or more data elements 420 for representing information for thatspecific record and specific field. Data elements 420 may exist invarious formats, such as alphanumeric, numeric, ASCII, and EBCDIC, orother representation as would be apparent. Raw data 210 collected fromdifferent sources may be formatted differently. Data records 400 mayinclude different data fields 410, and the information included in datafields 410 may be represented using data elements 420 in differentformats, as would also be apparent.

Examples of raw data 210 are illustrated in raw data tables 510, 520,and 530 of FIG. 5. Data records, such as data record 510-1 and datarecord 510-2, are illustrated as rows of raw data tables 510, 520, and530, whereas data fields, such as data field 510-A and data field 510-B,are illustrated as columns of raw data tables 510, 520, and 530. Thetables illustrated in FIG. 5 are examples of data that might be found invarious embodiments of the present invention. In other embodiments, datamay come from many sources and may be formatted as databases having amuch larger number of data records and/or data fields, as would beapparent.

Conversion to Numeric Format

Referring to FIG. 3, in a step 320, the present invention converts rawdata 210 from its original representation (which may be in alphanumeric,numeric, ASCII, EBCDIC, or other similar formats) to a numericrepresentation. This ensures that reference data is represented in thesame manner. Thus, the reference data, including that data fromdifferent sources, may be similarly processed.

According to the present invention, raw data 210 is converted from itsoriginal representation into an appropriate numeric representation. Anappropriate numeric representation uses a number system in which eachpossible value of data element 420 may be represented by a unique digitor value in the number system. In other words, a radix for the numbersystem is selected such that the radix is at least as great as thenumber of possible values for a particular data element. For example, ina biotechnology application for detecting nucleotide sequences ofAdenine (A), Guanine (G), Cytosine (C), and Thymine (T) in nucleicacids, each data element may be one of only four values: A, G, C, and T.In such an application, a radix of four for the number system may besufficient to represent each data element as a unique number. One suchnumber system may include the numbers A, G, C, and T. In someembodiments of the present invention, it may be desirable to use a radixat least one greater than the number of different possible value of dataelement 420 in order to provide a number representative of an emptyfield. In this case, such as number system may include the numbers A, G,C, T, and A, where A is the empty field value.

According to a preferred embodiment of the present invention, dataelements 420 in raw data 210 are comprised of characters such asalphanumeric characters. In this preferred embodiment, a radix of 40 isselected to represent the alphanumeric characters as illustrated in thetable below. (Note that a minimum radix of 36 is required.) This radixis selected to accommodate the ten numeric characters “0”-“9” and thetwenty-six alphabetic characters “A” to “Z” as well as to allow forseveral additional characters. In this embodiment, uppercase andlowercase characters are not distinguished from one another.

As illustrated in Table 1, the base-40 number system includes thenumbers 0-9, followed by A-Z, further followed by four additionalnumbers. One of these numbers may used to represent an empty field. Thisnumber is used to represent a data field 410 that is empty or has novalue (in contrast to a zero value). Other numbers may be used, forexample, to represent other types of information such as spaces or usedas control information.

TABLE 1 Alpha- Base-10 Base-40 Numeric Number Number 0 0 0 1 1 1 2 2 2 33 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 A or a 10 A B or b 11 B C or c12 C D or d 13 D E or e 14 E F or f 15 F G or g 16 G H or h 17 H I or i18 I J or j 19 J K or k 20 K L or l 21 L M or m 22 M N or n 23 N O or o24 O P or p 25 P Q or q 26 Q R or r 27 R S or s 28 S T or t 29 T U or u30 U V or v 31 V W or w 32 W X or x 33 X Y or y 34 Y Z or z 35 Z — 36 [— 37 \ — 38 ] — 39 {circumflex over ( )}

Representation of raw data 210 in a base-40 format has numerousbenefits. One benefit is that raw data 210 may be represented in anumeric fashion, facilitating straightforward mathematical manipulation.Another benefit is that proper selection of both the radix and thenumerals in the number system allows the represented content to maintainsemantic significance, facilitating recognition the content of raw data210 in its representation in the numeric format. For example, the word“JOHN” represented by the four alphanumeric characters “J” “O” “H” “N”may be represented in various number systems. One such number system isa base-40 number system. Using Table 1, representing the alphanumericcharacters “JOHN” as a base-40 number would result in the “tetradecimal”value ‘JOHN’, which is equivalent to the decimal value 1,255,103(19*40³+24*40²+17*40¹+23*40°, where base-40 ‘J’ equals decimal 19,etc.). Note that the base-10 number loses semantic significance from thecontent of raw data 210 whereas the base-40 number retains semanticsignificance, as the number ‘JOHN’ is recognizable as the content“JOHN.” Semantic significance provides the benefits of a numericrepresentation while maintaining the ability to convey semantic content.

In some embodiments of the present invention, the selection of a radixand its corresponding number system may depend upon the number of bitsused by processor 110. The number of bits used by processor 110 and theradix chosen for the number system define the number characters that canbe represented by a data word in processor 110. This relationship isgoverned according to the following equation:

N=B*ln(2)/ln(R),

where N is the number of whole characters (i.e., fractional charactersare discarded) represented by a data word of processor 110, B is thenumber of bits per data word, and R is the selected radix. Thisrelationship limits the number of data elements 420 of raw data 210 thatmay fit in a data word. For example, in a 32-bit machine, the maximumnumber of characters that may fit in a data word using a base-40 numbersystem is six (32*ln(2)/ln(40)=6.013). The maximum number of charactersthat may fit in a data word using a base-41 number system is only five(32*ln(2)/ln(41)=5.973). Thus, in some embodiments of the presentinvention, in addition to having a radix sufficiently large to maintainsemantic significance, the radix may also be selected to maximize thenumber of characters represented by a single data word. In theembodiment with raw data comprised of alphanumeric characters, anappropriate radix may range from 36 to 40. This range maintains semanticsignificance while maximizing the number of characters represented bythe 32-bit data word. Other types of raw data and other sizes of dataword may dictate other appropriate radix ranges in other embodiments ofthe present invention.

The embodiment of the present invention described above does notdistinguish between uppercase and lowercase characters. However, otherembodiments of the present invention may distinguish between these typesof characters. Accordingly, a base-64 representation (“0”-“9”, “A-“Z”,“a”-“z”, and two other values) may be appropriate to distinguish betweenthese characters as would be apparent.

The number of data elements 420 in each data field 410 also dictates theprecision required by the number as represented in processor 110. Asdescribed above, each data field 410 may only be six characters or dataelements 420 wide for single precision operations in a 32-bit machine.In some embodiments of the present invention, this may be insufficient.In these embodiments, double, triple, or even quadruple precision may berequired to represent the entire data field 410 as a single value.Double precision numbers are sufficient for up to twelve character datafields 410; triple precision numbers are sufficient for up to eighteencharacters; and quadruple precision numbers are sufficient for up totwenty-four characters.

Alternate embodiments of the present invention may accommodate largedata fields by breaking a large data field into one or more smaller datafields. The large data fields may be broken at boundaries defined byspaces. For example, a data field representing an address such as “123West Main Street” may be broken into four smaller data fields: ‘123’,‘West’, ‘Main’, and ‘Street’. The large data fields may also be brokenat data word boundaries. In the address example above, the smaller datafields might be: ‘123We’, ‘st\Mai’, ‘n\Stre’, and ‘et’, where the number‘\’ is used to represent a space. Other embodiments of the presentinvention may accommodate large data fields in other manners as would beapparent,

Data Structure Conversion

As illustrated in FIG. 3, in a step 330, raw data 210 represented as anumber is stored in a predefined data structure. In one embodiment ofthe present invention, this data structure is a single-field table asillustrated by Tables 610-670 of FIG. 6. This data structure may vary.For example, in other embodiments of the present invention, the datastructure may be a multiple-field table instead of a single-field table.In these embodiments, the data structures may be implemented withstandard features such as table headers and indices, and as explained ingreater detail below, may also include probability values for eachrecord. These probability values represent the likelihood that the datain that record is complete. Higher Probability values may indicate ahigher probability of completeness, and lower probability valuessimilarly may indicate a lower probability of completeness. This isdescribed in further detail below. Initially, the probability values areset to 0. Other embodiments may also include key numbers oridentification numbers to aid in sorting and in maintainingrelationships among the data records.

In a preferred embodiment of the present invention, raw data 210illustrated in FIG. 5 includes three tables 510, 520, and 530. Table 510may represent raw data 210 from, for example, a company's accountsreceivable system. Columns of table 510 represent data fields for anaccount number, a last name, a first initial, and additional fields forlisting various orders processed for a particular individual. Rows oftable 510 (such as 510-1 and 510-2) represent data records for differentindividuals. Tables 520 and 530 may represent raw data 210 maintained bycredit card companies. Columns of tables 520 and 530 represent datafields for an account number, a last name, a first name, and an address.Rows of tables 520 and 530 represent data records for specific accounts.

In the preferred embodiment, step 330 converts raw data 210 from theformat illustrated in FIG. 5 into a format illustrated in FIG. 6. FIG. 6illustrates raw data 210, combined from the various raw data tables 510,520, 530 of FIG. 5, represented as numbers in a base-40 number system,and formatted as new tables (tables 610-670), which together maycomprise reference database 220.

Each reference database table 610-670 corresponds to an individual fieldfrom raw data tables 510, 520, and 530 of FIG. 5. More specifically,data records of reference data tables 610-670 correspond to the datarecords of raw data table 510, followed by the data records of raw datatable 520, followed by the data records of raw data table 530. In oneembodiment of the present invention, where a raw data table record hasno information for a particular data field 410 represented in areference table 610-670, a empty field value is entered in that field inthe reference table. For example, the first data record 510-1 of Table510 has no information about an address, and thus an empty field valueis placed in the first position of table 670.

Data is preferably stored in reference database 220 in such a way thatall data corresponding to a single data record in a raw data table isreadily identified. In the embodiment represented in FIGS. 5 and 6, forexample, data corresponding to any specific data record of the raw datatables (tables 510, 520, 530) is preferably represented in referencetables 610-670 as a “vector” of numeric data stored at an index i acrossreference tables 610-670. For example, data corresponding to the sixthrecord 520-6 of raw data table 520 (illustrated as account number “A60”belonging to “Jennifer Brown,” residing at “51 Fourth Street”) isrepresented in reference database tables 610-670 as a vector havingcoefficients formed from the tenth records 610-10, 620-10, 630-10,640-10, 650-10, 660-10, and 670-10 of the tables 610-670.

As illustrated in FIG. 6, reference database 220 includes a new table610 that does not correspond to any data field 410 in raw data 210illustrated in FIG. 5. This table is a “key table” that identifies therelated data in these data vectors. As described below, referencedatabase 220 comprised of the tables illustrated in FIG. 6 may includeadditional key tables for data fields. These may include a personalidentification number (“PIDN”), an account identification number(“AIDN”), or other types of identification numbers. These key tables oridentification numbers may be used to identify sets of related datavectors in reference database 220.

In this example, key table 610 has a single field “PIDN,” which standsfor personal identification number. Key table 610 provides a uniqueidentifier such that a specific PIDN number never refers to more thanone person represented in raw data 210. In other words, the PIDN numberreflects the fact that many multiple records in raw data 210 may referto the same person.

Preferably, each data record in the key table 610 initially correspondsto a different data record represented in the raw data tables 510, 520,and 530. For example, in FIG. 6, data record 610-10 in the key table 610is implemented such that it includes identifiers (such as pointers orindices) for corresponding data in reference tables 620-670, whichtogether corresponds to a single record 520-6 in raw data table 520.

Initially, while a single PIDN does not refer to multiple individuals, asingle individual may correspond to multiple PIDNs. For example, in FIG.6, vector 4 (defined by PIDN 4) and vector 9 (defined by PIDN 9) appearto refer to the same person, but as illustrated, this person isinitially assigned to two PIDN numbers—PIDN 4 and PIDN 9. As describedbelow, the present invention enables a determination whether PIDN 4 andPIDN 9 do, in fact, refer to the same individual, and if so, assigns asingle PIDN to this individual. Alternatively, some embodiments mayassign a new PIDN number to individuals so determined and a reference tothe old PIDN number may be retained.

As discussed above, in this embodiment, records are represented in thereference database tables 610-670 as vectors having coefficients ofbase-40 numbers across eight one-field tables. This numericrepresentation allows the data to be analyzed using straightforwardmathematical operations that may be used to, for example, producecorrelations, calculate eigenvectors, perform various coordinatetransformations, and utilize various pattern recognition analyses. Theseoperations may, in turn, be used to provide or derive information aboutthe records and their relationships to one another. By using small,one-field fables, these operations may be performed quickly. Inaddition, as will be illustrated, representation in base-40 numbers withraw data 210 including alphanumeric characters allows content of rawdata 210 to retain its semantic significance.

Data Dialysis

Referring back to FIG. 2, once reference database 220 is created asillustrated in FIG. 6, a data dialysis process 700 is applied to distillthe most accurate data for inclusion in distilled database 230. Datadialysis 700 is now described with reference to FIG. 7.

Partitioning the Reference Data

In a step 710, reference database 220 is preferably partitioned orsorted into sets based on some criteria. These sorting criteria mayvary. For example, as illustrated in table 810 of FIG. 8, in thisembodiment, data records may be sorted into sets based on last name,with the values arranged in increasing numeric order (recall thatcontent of raw data is now represented as base-40 numbers in referencedatabase 220). Table 810 is derived from reference database table 620illustrated in FIG. 6, with each entry of table 810 defined by a uniquelast name and having a corresponding set of table 620 records matchingthat last name. In the representation illustrated, table 810 includes afield for defining the set (in this case, a last name), as well asidentifiers for members of the set (such as indices, pointers or otherappropriated references—in this case PIDNs).

In some embodiments of the present invention, not all vectors inreference database 220 will have data for the field on which the setsare based. Such vectors may be handled in various manners. For example,all vectors in reference database 220 having no data for that data fieldmay be regarded as members of a single, additional set. Alternatively,each vector in reference database 220 having no data for that data fieldmay be regarded as the single member of its own set.

Identifying Duplicate Data

Returning to FIG. 7, in a step 720, those data records within thepartitioned sets identified as duplicates are marked. In someembodiments of the present invention, duplicate data may be unnecessaryand may be discarded. In other embodiments, all information remains inreference database 220 as all information, even erroneous, incomplete,or duplicate information may be better than no information and may beuseful for some purpose, such as identifying fraud.

In some embodiments of the present invention, comparing a pair ofvectors may identify duplicates. Various operations may be used, aswould be apparent. In a simple example, a straightforward vectorsubtraction may be performed to measure the degree of similarity betweentwo records. Other techniques may be used to identify duplicate vectorssuch as using “look-up” tables to identify common names, nicknames,abbreviations, etc.

Table 810 of FIG. 8 illustrates that the last name “Smith” correspondsto PEDNs 2, 4, 8, 9, and 11, representing vectors formed from entries 2,4, 8, 9, and 11 of the reference database tables 610-670 illustrated inFIG. 6:

-   -   For PIDN 2: [SMITH, J, 98-002, A40, A60, ̂]    -   For PIDN 4: [SMITH, J, 98-004, A50, B10, ̂]    -   For PIDN 8: [SMITH, Jennifer, ̂, A40, ̂, 300 Pine St.]    -   For PIDN 9: [SMITH, John, ̂, A50, ̂, 37 Hunt Dr.]    -   For PIDN 11: [SMITH, Jhon, ̂, B10, ̂, 85 Belmont Ave.]

Vector (or matrix) operations comparing the vectors and thresholds fordetermining when two entries are similar enough to be regarded asduplicates may be defined as appropriate for various embodiments. In asimple example, the sum of the absolute differences betweencorresponding coefficients of a pair of vectors may indicate asimilarity between the corresponding pair of records. This pair ofvectors may be considered duplicates if a first vector is notinconsistent with any field of a second vector, and does not provide anyadditional data. In this embodiment, additional rules would also bedefined, for example, for comparing entries of different lengths (e.g.,right aligning character strings corresponding to numbers, and leftaligning character strings corresponding to letters), for recognizingcommonly misspelled or spelling variations of words, and for recognizingtransposed letters in words. This processing may be performed by variousmechanisms, as would be apparent. In the example of Table 810 of FIG. 8,none of the data records are exact duplicates, and so none are marked instep 720.

Correlating Data

Referring back to FIG. 7, in a steps 730, the preferred embodiment ofthe present invention correlates data records remaining within each setand in a step 740, further partitions the data records into independentsubsets of data records. In general, the “correlation” between twovectors is a measurement of how closely one is related to the other, andspecific methods of correlation may vary depending on the intendedapplication. A general discussion and examples of correlation functionsmay be found in references such as NUMERICAL RECIPES IN C: THE ART OFSCIENTIFIC COMPUTING (Cambridge University Press, 2nd ed. 1992) byWilliam H. Press, et al. Other techniques and examples may be found inTHE ART OF COMPUTER PROGRAMMING (Addison-Wesley Pub., 1998) by Donald E.Knuth.

As an example, a simple measurement of the correlation between vectorsis their dot product, which may be weighted as appropriate. Depending onthe application, the dot product may be calculated on only a subset ofthe vector coefficients, or may be defined to compare not onlycorresponding coefficients, but also other pairs of coefficientsdetermined to be in related fields (i.e., comparing a “first name”coefficient of a first vector with a “middle name” coefficient of asecond vector). As with the operations for identifying duplicate data,the correlation function may be appropriately tailored for its intendedapplication. For example, a correlation function may be defined toappropriately compare entries of different lengths and to appropriatelydistinguish between significant and insignificant differences, as wouldbe apparent.

In the embodiment explained with reference to the tables of FIGS. 5, 6,and 8, an example of a correlation function compares vectorscorresponding to the members of a set sharing the same last name toidentify independent subsets of vectors. Again, this determination maybe based on application-specific criteria. In this example, independentvectors may be defined to be those vectors representing differentindividuals.

As a result of applying the correlation function, a correlationparameter reflecting the degree of independence of a pair of vectors isassigned. For example, a high value may be assigned to indicate a highdegree of similarity, and a low value may be assigned to indicate alimited degree of similarity. The correlation value is then compared toa predetermined threshold value—which again, may vary in differentapplications—to determine whether the two records corresponding to thosevectors are considered to be independent.

Based on the correlation values, in a step 740, the preferred embodimentpartitions the data records into subsets of independent data recordswithin each set. In the examples of FIG. 5, 6, and Table 810 of FIG. 8,members of an independent subset may be identified as those membershaving: the same last name (taking into consideration misspellings andspelling variations); relatively similar first names (taking intoconsideration misspellings, spelling variations, nicknames, andcombinations of first and middle names and initials); having one or morematching account numbers; and having no more than three addresses (toallow for work and home addresses, and one change of address).

Results of applying such a function are illustrated in Table 820 of FIG.8. The individuals identified are:

-   -   Jennifer Brown, PIDN 10;    -   Howard Lee, PIDNs 3 and 6;    -   Carole Lee, PIDN 7;    -   Jennifer Smith, PIDNs 2 and 8;    -   John Smith, PIDNs 4 and 11;    -   John Smith, PIDN 9;    -   Ann Zane, PIDNs 1, 5, and 12; and    -   Molly Zane, PIDN 13.

Other operations for correlating the vectors are available. These mayinclude computing dot products, cross products, lengths, directionvectors, and a plethora of other functions and algorithms used forevaluation according to well-known techniques.

FIG. 9 illustrates a two-dimensional example of a concept referred to asclustering which is used conceptually to describe some general aspectsof the present invention. In FIG. 9, four clusters exist as a collectionof two-dimensional points. These clusters are identified as: (a,b),(c,d), (e,f), and (g,h). As illustrated, each cluster is formed from oneor more points in the two-dimensional space. Each point corresponds to adata record that represents (with more or less accuracy) the “true”value of the cluster in the space. As illustrated, clusters (a,b,) and(c,d) are fairly easy to distinguish from one another and from clusters(e,f) and (g,h). However, in this simple example, clusters (e,f) and(g,h) are not easily distinguished from one another. Extending the space(i.e., adding additional data fields to the vectors), may increase theseparation between clusters such as (e,f) and (g,h) so that they becomemore readily distinguished from one another. Alternately, extending thespace may indicate that (g,h) is a point that belongs to cluster (e,f)or even cluster (c,d). In the abstract, the space may be extendedinfinitely, resulting in a Hilbert space, which has various well-knowncharacteristics. These characteristics may be exploited by the presentinvention for large, albeit not infinite, vectors as would be apparent.

Furthermore, while adding additional data fields to the vectors (i.e.,extending the space) may separate clusters from one another to aid intheir correlation, deleting data fields from the vectors (i.e., reducingthe space) may also identify some correlations. In some embodiments ofthe present invention, reducing the space may identify, certain clustersthat are in fact representing the same individual or other uniqueentity. For example, one record in a database may have ten data fieldsexactly identical to the same ten data fields in a second record in thedatabase. These data fields may correspond to a first name, a birthdate, an address, a mother's maiden name, etc. However, these tworecords may have two fields that are different. These two fields maycorrespond to a last name and a social security number. In some cases,these records may correspond to the same individual. The presentinvention simplifies the process for identifying these types of recordsthat would be difficult, if not impossible, to detect using conventionalmethods.

Thus, removing one or more particular data fields from a vector andreducing the corresponding space may reveal clusters that otherwisewould not be apparent. Doing this for data fields traditionally used foridentification purposes (e.g., last name, social security number, etc.)may reveal duplicate records in databases. This may be particularlyuseful for identifying fraud. Removing data fields where a vectorincludes an empty field value for that data field may also revealclusters that would not otherwise be apparent.

Furthermore, once the clusters are identified as representing the sameindividual or entity, the best information for the individual or entitymay be extracted from the information provided by each record or “blackdot.”

The principles of the present invention may be extended beyond simplevectors and data fields. For example, the present invention may beextended through the use of tensors representing objects in amulti-dimensional space. In this manner, the present invention may beused to represent the parameters of various physical phenomenon to gainadditional insight into their operation and effect. Such application maybe particularly useful for deciphering the human gene and aid in theefforts of programs such as the Human Genome Project.

Handling Stranded Data

Referring again to FIG. 7, in a step 750, the preferred embodiment ofthe present invention evaluates “stranded” data records. Stranded datarecords are those records from reference database 220 that were notpartitioned into any set in step 710. In some embodiments, referencedatabase 220 may include a large number of tables corresponding to datafields and a large number of vectors having data for variouscombinations of fields. For example, in an embodiment having a referencedatabase 220 including 20 tables for different data fields and 1000vectors defined by related data records for each table, suppose only 800of those 1000 vectors have data for the field “last name,” by which thesets were created in step 710. Step 710 may not partition those 200vectors with no “last name” data into any set, or to partition each ofthose 200 vectors into its own set. In either case, the result is thatthose 200 vectors are not correlated with any others in steps 720, 730,and 740. Step 750 may evaluate those vectors.

Methods of evaluation may vary. For example, one embodiment maycorrelate each stranded entry with one member of each subset identifiedin step 740. Depending on the resulting correlation values, that vectormay be added to the subset with which it is most highly correlated, ormay define a new subset. Alternatively, in some embodiments, it may bedetermined that such evaluation is too time-consuming and step 750 maybe completely skipped.

Repeating the Correlation Process

Steps 710-750 may be repeated as needed for specific embodiments. Asnoted above, some embodiments will have reference data 220 having alarge number of fields and a large number of entries, with many entrieshaving data for only a subset of fields. In such a case, performingsteps 710-750 on a single field is unlikely to derive all relevantinformation. Even in the simple example explained with reference toFIGS. 5, 6, and 8, correlating on the single field “last name” mayprovide only partial information about the correlation between thoseentries. For example, Jennifer Smith, corresponding to PIDNs 2 and 8 inFIG. 6, may be the same individual as Jennifer Brown, corresponding toPIDN 10, because PIDNs 2 and 10 may share a common account number.Performing the correlation on the last name field may not identify thesePIDNs as corresponding to the same individual because they wereevaluated only against other PIDNs sharing the same last name.Performing a correlation on the account number field may provideadditional information about whether these PIDNs are related.

Thus, correlation across various data fields may be necessary to fullyevaluate the degree of relatedness of the data in reference database220.

Using Correlation Results to Update Reference Data

Once steps 710-760 are completed, reference database 220 has beendistilled into a distilled database 230, as illustrated in FIG. 2. Insome embodiments of the present invention, these two databases arehandled separately and coexist with one another. In other embodiments ofthe present invention, a single database exists with records marked orotherwise identified as belonging to reference database 220 or distilleddatabase 230. This may be accomplished by assigning by using differentranges of PIDNs for the records in the two databases. Furthermore,relationships between records in the two databases may be maintained byadding a constant value to the PIDN for the record in reference database220 to generate a PIDN for the record in distilled database 230. Forexample, a record with a PIDN of 12345 in reference database 220 mayhave a PIDN of 9012345 in distilled database 230. In this manner, thetwo databases may be treated as distinct portions of a single database.

Using the Distilled Data

Once data dialysis process 700 is complete, distilled database 230identifies subsets of data records from the reference database 220 asrelated records, and as noted above, probabilities may be determined forfields in the reference database 220 to provide a qualitative measure oftheir completeness. This may be accomplished by assigning a probabilityof completeness to each of the individual data fields and then usingthem to compute an overall probability of completeness for the datarecord. For example, for a data field representing a first name, a valueof ‘J’ may be assigned a low probability (e.g., 0 or 0.1), a value of‘JOHN’ may be assigned a higher probability (e.g., 0.7 or 0.8), and avalue of ‘JONATHAN’ may be assigned the highest probability (e.g., 0.9or 1.0). These values may be assigned somewhat arbitrarily. However,these values help identify which data fields in the set are most likelyto include the most complete information or in other words, the mostprobable data.

Use of the present invention may determine a significant amount ofinformation about the records and their relationship to each other, andmay be specifically tailored for particular applications. Furthermore,using standard database operations, distilled database 230 (whichreferences records of the reference database 220) may be manipulated toprovide formatted reports as needed. For example, an embodiment may betailored to generate a report listing subsets of related records, withrecords of a subset providing information about a specific individual orentity. The records within such a subset may provide information, forexample about different fields of information; aliases and/or variationsof names, addresses, social security numbers, etc., used by theindividual; and fields—such as occupation, address, and accountnumbers—for which that individual may have more than one entry.

Recalling that all data is represented in numerical base-40 format, thesubsets may be ordered numerically in the report. The base-40 formatprovides the additional advantage of representing alphabeticalcharacters as their respective letters (as illustrated in the conversiontable above). Thus, while the report will show entries in numericalrepresentation, that representation retains the semantic significance ofthe data it represents, allowing the data to be manually read andanalyzed. For example, if the report shows records for an individualhaving entries for names including J SMITH, JOHN SMITH, JOHN G SMITH, GSMITH, and GERALD SMITH, a person reading that report would understandthat this individual uses various first names, including his first nameor initial, his middle name or initial, or some combination thereof.

Adding New Data

As with conventional database applications, new data may be added fromtime to time. As illustrated in FIG. 2, the present invention accountsfor adding new (or changed) data 240, which will affect referencedatabase 220 and distilled database 230.

Generally, new data records 240 may be formatted as described withreference to FIG. 3, and entered into the existing reference database220. Additionally, new data records 240 may be measured againstdistilled database 230 to determine if new information or content isavailable in new data record 240. For example, a new data record 240 maybe correlated with data records from distilled database 230 to determinewhether that new data record 240 is related to any data records alreadypresent in distilled database 230. If so, and new data record 240contains information or content not already present in distilleddatabase 230, new data record 240 may be used to update distilleddatabase 230. For example, if new data record 240 included informationfor an individual named John Smith that corresponds to data recordsalready present in distilled database 230 but provided the additionalinformation that Mr. Smith's middle name was Greg, that additionalinformation may be appropriately added to distilled database 230.

Changes to data records in reference database 220 and distilled database230 may be handled using standard database protection operations, asdescribed in references such as C. J. DATE, INTRODUCTION TO DATABASESYSTEMS (Addison Wesley, 6th ed. 1994) (see specifically, Part IV),referenced above. For example, in the case that changes are made toreference database 220 by an authorized database administrator, relateddata records in reference database 220 are updated as determined bystandard relational definitions and where appropriate, in accordancewith relations defined in distilled database 230.

Identifying Duplicate Data Between Field Vectors

One problem associated with conventional databases is a difficulty inmerging records from a first database, such as raw data 210A, with thosefrom a second database, such as raw data 210B. Records in thesedatabases having shared or duplicate data need to be identified so thatthe content included therein may be merged as a single record in adatabase such as reference database 220 or distilled database 230. Forexample, both databases 210 may include one or more entries for JOHNSMITH. If the respective records in the databases 210 represent the sameindividual John Smith, then the content of each of the records should bemerged as a single record in, for example, distilled database 230.

Conventional brute force methods for identifying such duplicate data inthese databases involve comparing a data record from the first databasewith every data record in the second database, and repeating thisprocess for each record in the first database. This process is timeconsuming and computationally intensive. In fact, the number ofcomputations is geometrically related to the number of records in eachof the two databases.

One process for reducing the time and number of computations required toidentify the duplicate data in the databases 210 is described below withreference to FIGS. 10-12. In the process described below, a particularfield common or similar among the databases is selected, for example aname field or an address field. This field is arranged as a table or anarray for each of the databases that includes the value of the selectedfield for each of the records. For example, as discussed above, eachtable 610-670 represents a particular field of each of the data recordsin a database. For purposes of this discussion, these tables arereferred to as field vectors.

According to the present invention, each of the field vectors are sortedin numerical order, and if necessary, partitioned into sets of identicaldata as described above with respect to FIGS. 7 and 8. For example,multiple records associated with JOHN SMITH would be partitionedtogether within the field vector. Preferably, information regarding thelocation of the partitions between the sets is stored.

Once the field vectors are sorted and partitioned, a value of the firstelement of a first field vector is compared with a value of the firstelement of a second field vector. Essentially, if the value in the firstfield vector is greater than the value in the second field vector, anindex into the second field vector is advanced or otherwise adjusted toa position within the next partitioned set to obtain a next value in thesecond field vector. This next value in the second field vector is thencompared to the value in the first field vector. This continues as longas the value in the first field vector is greater than the value in thesecond field vector.

On the other hand, if the value of the first field vector is less thatthe value of the second field vector, an index into the first fieldvector is advanced or otherwise adjusted to a position with the nextpartitioned set to obtain a next value in the first field vector. Thisnext value in the first field vector is then compared to the value inthe second field vector. This continues as long as the value in thefirst field vector is less than the value in the second field vector.

When the value of the first field vector equals the value in the secondfield vector, the process has identified duplicate data that is thenpreferably stored in a common field vector. After storing the identifiedduplicate data, the index into the first field vector and the index intothe second field vector are both advanced or otherwise adjusted to aposition within the next partitioned set of their respective fieldvectors.

The process thus described may be viewed as feedback control mechanismthat adjusts the index into either of the arrays based on the differencebetween the values in the field vectors. In the embodiment describedabove, a positive difference generates an adjustment to the index of thesecond field vector whereas a negative difference generates anadjustment to the index of the first field vector. This process resultsin a linear relationship between the number of values in the fieldvectors and the number of computations (i.e., comparisons) required asopposed to the geometric relationship associated with conventionalmethods.

The present invention may be extended to sorting mechanisms as well. Incases where a particular value must be inserted into a field vector(i.e., a record must be inserted into a database) based on an orderingof the values in the vector (e.g., alphabetically, numerically, etc.), adifference between the particular value and a value of one of theelements in the vector is computed. This difference is “fed back” toadjust the index into the vector to generate the next value from thevector. Using well-established methods of control theory, the indexadjustments may be integrated to determine the proper location of thevalue to be inserted. In addition to the integrator, a proportional gainmay be applied to the difference to establish a desired systemperformance as would be apparent.

The present invention is now described with reference to FIGS. 10-12.FIG. 10 is a flow diagram for identifying duplicate data within a pairof field vectors. The field vectors may be from a single source such asraw data 210A (e.g., when comparing a Residential Address Field with aMailing Address in a single database) or from multiple sources such asraw data 210A and raw data 210B (e.g., when comparing a Name Fieldbetween two databases).

For purposes of this description, the pair of field vectors are referredto as a first field vector (“FV1”) and a second field vector (“FV2”),respectively. Preferably, the data in these field vectors are base-40numbers that represent alphanumeric data as described above. However, insome embodiments of the present invention, the data may exist in otherforms as well.

In a step 1010, the first field vector is sorted in numerical order. Ina step 1020, the second field vector is also sorted in numerical order.In one embodiment of the present invention, the vectors are sorted inincreasing numerical order, although other embodiments of the presentinvention may sort the vectors in decreasing order as would be apparent.

In a step 1030, partitioned sets within the first field vector havingcommon values are identified. Likewise, in a step 1040, partitioned setswithin the second field vector having common values are also identified.Steps 1010-1040 perform a similar function to the step of partitioningreference database 220 described above with reference to FIGS. 7 and 8.In some embodiments of the present invention, the field vectors may notinclude any partitioned sets as the common values within each fieldvector may have been eliminated. However, in a preferred embodiment ofthe present invention, the common values within a particular fieldvector are maintained.

In a step 1050, a common value vector that identifies the common valuesbetween the first and second field vectors is determined, preferablyusing the partitioned sets. Step 1050 is described in further detailwith reference to FIG. 11.

FIG. 11 is a flow diagram for identifying common values between a pairof field vectors. In a step 1110, three vector indices are initialized.A first vector index, I, is an index into the first field vector FV1; asecond vector index, J, is an index into the second field vector FV2;and a third vector index, K, is an index into the common value vector(“CV”). As mentioned above, the common value vector includes the valuesshared by both first and second field vectors. Indices I and J areinitialized to locate a first position in each of the first and secondfield vectors, respectively. Index K is initialized to locate a positionfor a next common value to be included in the common value vector.

In a decision step 1120, the present invention determines whether thevalue in the I-th position of the first field vector is greater than orequal to the value of the J-th position of the second field vector. Ifso, processing continues at a decision step 1130; otherwise, processingcontinues at a step 1170. Step 1170 is performed, effectively, when thevalue in the I-th position of the first field vector is less than thevalue of the J-th position of the Second field vector. In step 1170, thefirst index I is adjusted to locate the beginning of the nextpartitioned set in the first field vector. After step 1170, processingcontinues at a decision step 1160.

In decision step 1130, the present invention determines whether thevalue in the I-th position of the first field vector is equal to thevalue of the J-th position of the second field vector. If so, processingcontinues at a decision step 1140; otherwise processing continues at astep 1180. Step 1180 is performed, effectively, when the value in theI-th position of the first field vector is greater than value of theJ-th position of the second field vector. In step 1180, the second indexJ is adjusted to locate the beginning of the next partitioned set in thesecond field vector. After step 1180, processing continues at decisionstep 1160.

Step 1140 is performed, effectively, when the value in the I-th positionof the first field vector is equal to the value of the J-th position ofthe second field vector. In step 1140, the value included in both thefirst and second field vectors is placed in the common value vector.

In a step 1150, the third index K is incremented to locate the positionin the common value vector of the next common value to be identified.The first index I is adjusted to locate the beginning of the nextpartitioned set in the first field vector. The second index J isadjusted to locate the beginning of the next partitioned set in thesecond field vector.

In decision step 1160, the present invention determines whetheradditional partitioned sets exist in both the first field vector and thesecond field vector. If so, processing continues at step 1120. If nopartitioned sets remain in either the first field vector or the secondfield vector, processing ends. When processing ends, the common valuevector includes all the duplicate data identified between the first andsecond field vectors.

FIG. 12 illustrates an example of identifying duplicate data betweenfield vectors according to the present invention. Steps 1010 and 1030sort and partition field vector 1 (“FV1”) and steps 1020 and 1040 sortand partition a field vector 2 (“FV2”). The operation of step 1050 isnow described with reference to steps 1110-1180 where traversal throughsteps 1120 to step 1160 and back to step 1120 is referred to as a“loop.”

In a first loop, the first element (i.e., 0-th position) of FV1 iscompared with the first element of FV2. (This is illustrated in FIG. 12as a line between FV1 and FV2 having arrows on both ends and annotatedwith 1). In this example, a value ‘8’ of FV1 is compared with a value‘8’ of FV2. Decision steps 1120 and 1130 determine that these values areequal and, in step 1140, the value ‘8’ is placed in the common valuevector. (This is illustrated in FIG. 12 as a line between FV2 and theCOMMON VALUE VECTOR having arrows on both ends and annotated with 1′.)Step 1150 adjusts the indices of both field vectors to point at the nextpartitioned set. Decision step 1160 determines that more partitionedsets exist in both field vectors and a second loop is started.

In the second loop, the next element of FV1 is compared with the nextelement of FV2. In this example, a value ‘9’ of FV1 is compared with avalue ‘9’ of FV2. These values are again determined to be equal and thevalue ‘9’ is placed in the common value vector. As before, step 1150adjusts both indices to point at the next partitioned sets in theirrespective field vectors. Decision step 1160 determines that morepartitioned sets exist in both field vectors and a third loop isstarted.

In the third loop, the next element of FV1 is compared with the nextelement of FV2. In this example a value ‘10 of FV1 is compared with avalue ‘12’ of FV2. Decision step 1120 determines that the value in FV1is not greater than or equal to the value in FV2 and, in step 1170, theindex to FV1 is adjusted to point at the next partitioned set therein.Decision step 1160 determines that more partitioned sets exist in bothfield vectors and a fourth loop is started.

In the fourth loop, the next element of FV1 is compared with theprevious value of FV2. In this example, a value ‘12’ of FV1 is comparedwith the previously compared value of ‘12’ of FV2. Decision steps 1120and 1130 determine that the values are equal, and in step 1140, thevalue ‘12’ is placed in the common value vector. Step 1150 adjusts bothindices to point at the next partitioned sets in their respective fieldvectors. Decision step 1160 determines that more partitioned sets existin both field vectors and a fifth loop is started.

In the fifth loop, the next element of FV1 is compared with the nextvalue of FV2. In this example, a value ‘15’ of FV1 is compared with avalue ‘18’ of FV2. Decision step 1120 determines that the value in FV1is not greater than or equal to the value in FV2 and, in step 1170, theindex to FV1 is adjusted to point at the next partitioned set therein.Because no more partitioned sets exist in FV1, processing ends.

In this example, five loops with a maximum of two comparisons per loopare required to identify three common values between the two fieldvectors. In a brute force method, 132 comparisons (12*11) are required.

Various embodiments of the present invention may be used for manydifferent applications, some of which have been described and/or alludedto above. For example, in the application described above, the inventionmay be used to combine billing information collected from multiplesources to derive a distilled database in which related data records arerecognized and duplicate and erroneous data records are eliminated. Assuggested; this may be particularly useful in cases, for example,involving fraud. Typically, persons using credit card or other forms ofretail fraud make minor changes to certain pieces of their personalinformation while leaving the majority of it the same. For example,oftentimes, digits in a social security number may be transposed or analias may be used. Often, however, other information such as theperson's address, date of birth, mother's maiden name, etc., is usedidentically. These types of fraud are readily identified by the presentinvention, even though they are difficult to identify by human analyses.

Other possible applications include uses in telemarketing, to compile alist of targeted individuals or addresses, or in mail-order catalogs, toreduce a number of catalogs sent to the same individual or family. Stillanother potential application is in the medical research or diagnosticsfields, in which nucleotide sequences of Adenine (A), Guanine (G),Cytosine (C), and Thymine (T) in nucleic acids may be identified.

In other embodiments, the present invention may be used as a gatekeeperfor a particular database at the outset to maintain integrity of thedatabase from the very beginning, rather than achieving integrity in thedatabase at a later date. In these embodiments, no raw data 210 ispresent and only new data 240 exists. Before new data 240 is added tothe database, it is measured against distilled database 230 to determinewhether new data 240 includes additional information or content. If so,only that new information or content is added to distilled database 230by updating an existing record in distilled database 230 to reflect thenew information or content as would be apparent.

While this invention has been described in a preferred embodiment, otherembodiments and variations are within the scope of the following claims.For example, formatting process 300 may format data using differentradices or other character sets, and may use various data structures.The data structures may represent multiple fields, and depending on theapplication, will represent a variety of fields. For example, in acredit application, fields may include an account status, an accountnumber, and a legal status, in addition to personal information aboutthe account holder. In a medical diagnostic application, fields mayinclude various alleles or other genetic characteristics detected intissue samples.

1. A computer-implemented method for identifying duplicate data betweena first vector and a second vector comprising: sorting, by at least onecomputing processor, values in the first vector in a decreasing order;partitioning, by the at least one computing processor, the sorted valuesin the first vector into first sets, wherein at least one of the firstsets includes a plurality of sorted values that have a common value;sorting, by the at least one computing processor, values in the secondvector in said decreasing order; partitioning, by the at least onecomputing processor, the sorted values in the second vector into secondsets, wherein at least one of the second sets includes a plurality ofmembers that have a common value; comparing, by the at least onecomputing processor, a first sorted value at a first index in a firstone of the first sets of the partitioned first vector with a secondsorted value at a second index in a first one of the second sets of thepartitioned second vector; adjusting, by the at least one computingprocessor, said first index to a next one of the first sets of thepartitioned first vector if said first sorted value is greater than saidsecond sorted value; adjusting, by the at least one computing processor,said second index to a next one of the second sets of the partitionedsecond vector if said second sorted value is greater than said firstsorted value; and identifying, by the at least one computing processor,said first and second sorted values as duplicate data if said firstsorted value is equal to said second sorted value.
 2. Acomputer-implemented method for identifying duplicate data between afirst vector and a second vector comprising: sorting, by at least onecomputing processor, values in the first vector in an increasing order;partitioning, by the at least one computing processor, the sorted valuesin the first vector into first sets, each of the first sets having atleast one sorted value, all sorted values in each of the first setshaving a common value, wherein at least one of the first sets includes aplurality of sorted values; sorting, by the at least one computingprocessor, values in the second vector in said increasing order;partitioning, by the at least one computing processor, the sorted valuesin the second vector into second sets, each of the second sets having atleast one sorted value, all sorted values in each of the second setshaving a common value, wherein at least one of the second sets includesa plurality of sorted values; comparing, by the at least one computingprocessor, a first sorted value at a first index in a first one of thefirst sets of the partitioned first vector with a second sorted value ata second index in a first one of the second sets of the partitionedsecond vector; adjusting, by the at least one computing processor, saidfirst index to a next one of the first sets of the partitioned firstvector if said first sorted value is less than said second sorted value;adjusting, by the at least one computing processor, said second index toa next one of the second sets of the partitioned second vector if saidsecond sorted value is less than said first sorted value; andidentifying, by the at least one computing processor, said first andsecond sorted values as duplicate data if said first sorted value isequal to said second sorted value.
 3. The method of claim 2, wherein atleast one of the second sets includes a plurality of sorted values. 4.The method of claim 1, further comprising storing said duplicate data.5. The method of claim 2, further comprising storing said duplicatedata.
 6. A computer-implemented method comprising: sorting, by at leastone computing processor, values in a first vector in an increasingorder; partitioning, by the at least one computing processor, the sortedvalues in the first vector into first sets, each of the first setshaving at least one sorted value, all sorted values in each of the firstsets sharing a common value, wherein at least one of the first setsincludes a plurality of sorted values; sorting, by the at least onecomputing processor, values in a second vector in said increasing order;partitioning, by the at least one computing processor, the sorted valuesin the second vector into second sets, each of the second sets having atleast one sorted value, all sorted values in each of the second setshaving a common value; comparing, by the at least one computingprocessor, a first sorted value at a first index in a first one of thefirst sets of the partitioned first vector with a second sorted value ata second index in a first one of the second sets of the partitionedsecond vector, wherein the first one of the first sets of thepartitioned first vector includes a plurality of sorted values; whensaid first sorted value is less than said second sorted value, adjustingsaid first index to a next one of the first sets of the partitionedfirst vector; when said second sorted value is less than said firstsorted value, adjusting said second index to a next one of the secondsets of the partitioned second vector; and when said first sorted valueis equal to said second sorted value, adjusting said first index to anext one of the first sets of the partitioned first vector and adjustingsaid second index to a next one of the second sets of the partitionedsecond vector.
 7. The method of claim 6, wherein the first sorted valueis one of the at least one sorted value in the first one of the firstsets of the partitioned first vector, and wherein the second sortedvalue is one of the at least one sorted value in the second one of thesecond sets of the partitioned second vector.